1 Introduction

Airbnb is among the most frequently used platforms to book short-term rentals all over the world. In this analysis, we put ourselves in the shoes of a tech-savy couple that currently plans a trip to Berlin and wants to book an apartment via Airbnb. Having access to city-specific Airbnb data, the goal of the analysis is therefore to find a regression model, which predicts the price that this couple would have to pay for a 4-night stay at some Airbnb apartment.

There are three main steps to this analysis: (i) the data exploration and feature selection, (ii) the model selection and validation, (iii) a quick summary on findings and recommendation.

2 Data Exploration and Feature Selection

First, we import the relevant libraries and define some of the basic settings for the analysis.

Next, we load the relevant data from insideairbnb.com. We cache this data so that it does not download every time that the document is knitted.

Now that the data is loaded, it helps to understand get a feel for the different variables. This part of the analysis is known as Exploratory Data Analysis. There are three substeps to this:

2.1 Looking at the raw values via the glimpse() command

This tells us that we are looking at more than 18k Airbnb rentals in London, for which we have 74 variables. “Glimpse” also tells us that the variables are in all kinds of formats and likely require some manipulation for the actual analysis. For instance, “host_acceptance_rate” is in format even though it is clearly a numeric variable.

Rows: 18,288
Columns: 74
$ id                                           <dbl> 2015, 3176, 7071, 9991, 1~
$ listing_url                                  <chr> "https://www.airbnb.com/r~
$ scrape_id                                    <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped                                 <date> 2021-09-22, 2021-09-22, ~
$ name                                         <chr> "Berlin-Mitte Value! Quie~
$ description                                  <chr> "Great location!  <br />3~
$ neighborhood_overview                        <chr> "It is located in the for~
$ picture_url                                  <chr> "https://a0.muscache.com/~
$ host_id                                      <dbl> 2217, 3718, 17391, 33852,~
$ host_url                                     <chr> "https://www.airbnb.com/u~
$ host_name                                    <chr> "Ion", "Britta", "BrightR~
$ host_since                                   <date> 2008-08-18, 2008-10-19, ~
$ host_location                                <chr> "Key Biscayne, Florida, U~
$ host_about                                   <chr> "Isn’t sharing economy gr~
$ host_response_time                           <chr> "within an hour", "a few ~
$ host_response_rate                           <chr> "100%", "40%", "100%", "N~
$ host_acceptance_rate                         <chr> "91%", "100%", "N/A", "0%~
$ host_is_superhost                            <lgl> TRUE, FALSE, TRUE, FALSE,~
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/~
$ host_picture_url                             <chr> "https://a0.muscache.com/~
$ host_neighbourhood                           <chr> "Mitte", "Prenzlauer Berg~
$ host_listings_count                          <dbl> 5, 1, 2, 1, 4, 4, 2, 1, 4~
$ host_total_listings_count                    <dbl> 5, 1, 2, 1, 4, 4, 2, 1, 4~
$ host_verifications                           <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified                       <lgl> FALSE, TRUE, TRUE, TRUE, ~
$ neighbourhood                                <chr> "Berlin, Germany", "Berli~
$ neighbourhood_cleansed                       <chr> "Brunnenstr. Süd", "Prenz~
$ neighbourhood_group_cleansed                 <chr> "Mitte", "Pankow", "Panko~
$ latitude                                     <dbl> 52.53305, 52.53471, 52.54~
$ longitude                                    <dbl> 13.40394, 13.41810, 13.41~
$ property_type                                <chr> "Entire guesthouse", "Ent~
$ room_type                                    <chr> "Entire home/apt", "Entir~
$ accommodates                                 <dbl> 2, 4, 2, 7, 1, 5, 2, 4, 4~
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text                               <chr> "1 bath", "1 bath", "1 sh~
$ bedrooms                                     <dbl> 1, 1, 1, 4, NA, 1, NA, 2,~
$ beds                                         <dbl> 0, 2, 2, 7, 1, 3, 0, 2, 2~
$ amenities                                    <chr> "[\"Refrigerator\", \"Hea~
$ price                                        <chr> "$77.00", "$90.00", "$33.~
$ minimum_nights                               <dbl> 90, 62, 1, 6, 90, 60, 5, ~
$ maximum_nights                               <dbl> 1125, 1125, 10, 14, 1125,~
$ minimum_minimum_nights                       <dbl> 33, 62, 1, 6, 90, 60, 5, ~
$ maximum_minimum_nights                       <dbl> 90, 62, 1, 6, 90, 60, 5, ~
$ minimum_maximum_nights                       <dbl> 1125, 1125, 10, 14, 1125,~
$ maximum_maximum_nights                       <dbl> 1125, 1125, 10, 14, 1125,~
$ minimum_nights_avg_ntm                       <dbl> 88.2, 62.0, 1.0, 6.0, 90.~
$ maximum_nights_avg_ntm                       <dbl> 1125.0, 1125.0, 10.0, 14.~
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30                              <dbl> 0, 9, 0, 0, 0, 0, 0, 3, 0~
$ availability_60                              <dbl> 21, 9, 0, 0, 1, 0, 4, 31,~
$ availability_90                              <dbl> 51, 9, 0, 0, 31, 0, 4, 61~
$ availability_365                             <dbl> 326, 93, 0, 0, 102, 144, ~
$ calendar_last_scraped                        <date> 2021-09-22, 2021-09-22, ~
$ number_of_reviews                            <dbl> 143, 147, 293, 8, 26, 48,~
$ number_of_reviews_ltm                        <dbl> 10, 1, 0, 0, 1, 0, 21, 2,~
$ number_of_reviews_l30d                       <dbl> 1, 0, 0, 0, 0, 0, 3, 0, 0~
$ first_review                                 <date> 2016-04-11, 2010-12-21, ~
$ last_review                                  <date> 2021-07-22, 2017-03-20, ~
$ review_scores_rating                         <dbl> 4.66, 4.63, 4.83, 5.00, 4~
$ review_scores_accuracy                       <dbl> 4.79, 4.68, 4.85, 5.00, 5~
$ review_scores_cleanliness                    <dbl> 4.52, 4.53, 4.90, 5.00, 4~
$ review_scores_checkin                        <dbl> 4.88, 4.64, 4.86, 5.00, 4~
$ review_scores_communication                  <dbl> 4.89, 4.69, 4.85, 5.00, 4~
$ review_scores_location                       <dbl> 4.96, 4.92, 4.91, 4.86, 4~
$ review_scores_value                          <dbl> 4.59, 4.63, 4.71, 4.86, 4~
$ license                                      <chr> NA, NA, NA, "03/Z/RA/0034~
$ instant_bookable                             <lgl> FALSE, FALSE, TRUE, FALSE~
$ calculated_host_listings_count               <dbl> 5, 1, 1, 1, 3, 2, 1, 1, 2~
$ calculated_host_listings_count_entire_homes  <dbl> 5, 1, 0, 1, 3, 2, 1, 1, 2~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month                            <dbl> 2.15, 1.12, 2.40, 0.16, 0~

2.2 Computing summary statistics of the variables of interest

Using “favstats”, we can get a feel for the values that individual variables take on. We chose “accommodates”, “review_scores_rating”, “number_of_reviews”, and “beds” because our intuive sense was that these could all impact price in our eventual regression model.

From “favstats”, we learn that the median for accommodates is 2, while the maximum goes up to 16. Also, the average Airbnb rental has a review score of c. 4.6. Finally, there is one Airbnb with 17 beds. These are just some exemplary figures from this descriptive analysis that help us to get a better feel for the data. Also notice that we cannot yet run the command on “price”, since it is still saved as a character variable.

Using “skim”, we can see that there are certain variables where many values are missing (e.g., host_about). It is good to see that “price”, our dependent variable in the regression model, is not missing for any of the rentals.

Accommodates
min Q1 median Q3 max mean sd n missing
0 2 2 3 16 2.714129 1.619647 18288 0
Review Scores Rating
min Q1 median Q3 max mean sd n missing
0 4.61 4.85 5 5 4.626417 0.8043395 14716 3572
Number of Reviews
min Q1 median Q3 max mean sd n missing
0 1 4 17 655 22.78904 51.01942 18288 0
Beds
min Q1 median Q3 max mean sd n missing
0 1 1 2 17 1.624439 1.244291 18061 227
Skim Summary
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace Date.min Date.max Date.median Date.n_unique logical.mean logical.count numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character listing_url 0 1.0000000 33 37 0 18288 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character name 29 0.9984143 1 255 0 17766 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character description 544 0.9702537 1 1000 0 17156 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character neighborhood_overview 8702 0.5241689 1 1000 0 8570 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character picture_url 0 1.0000000 60 126 0 18047 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_url 0 1.0000000 38 43 0 14776 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_name 16 0.9991251 1 35 0 5177 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_location 59 0.9967738 1 199 0 952 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_about 9327 0.4899934 1 5095 0 6642 21 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_response_time 16 0.9991251 3 18 0 5 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_response_rate 16 0.9991251 2 4 0 66 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_acceptance_rate 16 0.9991251 2 4 0 99 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_thumbnail_url 16 0.9991251 55 106 0 14674 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_picture_url 16 0.9991251 57 109 0 14674 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_neighbourhood 6091 0.6669401 1 28 0 165 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character host_verifications 0 1.0000000 2 158 0 318 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character neighbourhood 8702 0.5241689 7 43 0 50 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character neighbourhood_cleansed 0 1.0000000 4 41 0 137 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character neighbourhood_group_cleansed 0 1.0000000 5 24 0 12 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character property_type 0 1.0000000 3 35 0 68 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character room_type 0 1.0000000 10 15 0 4 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character bathrooms_text 26 0.9985783 6 17 0 27 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character amenities 0 1.0000000 2 1416 0 15257 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character price 0 1.0000000 5 9 0 430 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
character license 16019 0.1240704 3 342 0 1921 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Date last_scraped 0 1.0000000 NA NA NA NA NA 2021-09-21 2021-10-03 2021-09-22 4 NA NA NA NA NA NA NA NA NA NA
Date host_since 16 0.9991251 NA NA NA NA NA 2008-08-08 2021-09-20 2015-09-16 3562 NA NA NA NA NA NA NA NA NA NA
Date calendar_last_scraped 0 1.0000000 NA NA NA NA NA 2021-09-21 2021-10-03 2021-09-22 4 NA NA NA NA NA NA NA NA NA NA
Date first_review 3572 0.8046807 NA NA NA NA NA 2010-12-21 2021-09-22 2018-07-10 2771 NA NA NA NA NA NA NA NA NA NA
Date last_review 3572 0.8046807 NA NA NA NA NA 2012-07-08 2021-09-26 2019-09-28 2226 NA NA NA NA NA NA NA NA NA NA
logical host_is_superhost 16 0.9991251 NA NA NA NA NA NA NA NA NA 0.1545534 FAL: 15448, TRU: 2824 NA NA NA NA NA NA NA NA
logical host_has_profile_pic 16 0.9991251 NA NA NA NA NA NA NA NA NA 0.9948555 TRU: 18178, FAL: 94 NA NA NA NA NA NA NA NA
logical host_identity_verified 16 0.9991251 NA NA NA NA NA NA NA NA NA 0.7887478 TRU: 14412, FAL: 3860 NA NA NA NA NA NA NA NA
logical bathrooms 18288 0.0000000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical calendar_updated 18288 0.0000000 NA NA NA NA NA NA NA NA NA NaN : NA NA NA NA NA NA NA NA
logical has_availability 0 1.0000000 NA NA NA NA NA NA NA NA NA 0.9803696 TRU: 17929, FAL: 359 NA NA NA NA NA NA NA NA
logical instant_bookable 0 1.0000000 NA NA NA NA NA NA NA NA NA 0.3035324 FAL: 12737, TRU: 5551 NA NA NA NA NA NA NA NA
numeric id 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 2.557156e+07 1.540011e+07 2.015000e+03 1.218794e+07 2.385470e+07 3.968697e+07 5.238006e+07 <U+2587><U+2587><U+2587><U+2586><U+2587>
numeric scrape_id 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 2.021092e+13 0.000000e+00 2.021092e+13 2.021092e+13 2.021092e+13 2.021092e+13 2.021092e+13 <U+2581><U+2581><U+2587><U+2581><U+2581>
numeric host_id 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 9.337946e+07 1.083088e+08 1.581000e+03 1.194556e+07 4.352120e+07 1.449065e+08 4.238179e+08 <U+2587><U+2582><U+2581><U+2581><U+2581>
numeric host_listings_count 16 0.9991251 NA NA NA NA NA NA NA NA NA NA NA 4.556042e+00 4.036450e+01 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 2.010000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric host_total_listings_count 16 0.9991251 NA NA NA NA NA NA NA NA NA NA NA 4.556042e+00 4.036450e+01 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 2.010000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric latitude 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 5.250997e+01 3.244370e-02 5.234007e+01 5.248953e+01 5.250974e+01 5.253325e+01 5.265611e+01 <U+2581><U+2581><U+2587><U+2583><U+2581>
numeric longitude 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 1.340509e+01 6.332170e-02 1.309715e+01 1.336797e+01 1.341485e+01 1.343918e+01 1.375736e+01 <U+2581><U+2582><U+2587><U+2581><U+2581>
numeric accommodates 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 2.714129e+00 1.619647e+00 0.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.600000e+01 <U+2587><U+2582><U+2581><U+2581><U+2581>
numeric bedrooms 1609 0.9120188 NA NA NA NA NA NA NA NA NA NA NA 1.271779e+00 6.272113e-01 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.200000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric beds 227 0.9875875 NA NA NA NA NA NA NA NA NA NA NA 1.624439e+00 1.244291e+00 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.700000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric minimum_nights 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 9.324256e+00 3.423886e+01 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric maximum_nights 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 5.883037e+02 5.282968e+02 1.000000e+00 2.800000e+01 3.650000e+02 1.125000e+03 5.000000e+03 <U+2587><U+2587><U+2581><U+2581><U+2581>
numeric minimum_minimum_nights 1 0.9999453 NA NA NA NA NA NA NA NA NA NA NA 9.220867e+00 3.415314e+01 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric maximum_minimum_nights 1 0.9999453 NA NA NA NA NA NA NA NA NA NA NA 9.878657e+00 3.535246e+01 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric minimum_maximum_nights 1 0.9999453 NA NA NA NA NA NA NA NA NA NA NA 4.704186e+05 3.175798e+07 1.000000e+00 3.000000e+01 1.125000e+03 1.125000e+03 2.147484e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric maximum_maximum_nights 1 0.9999453 NA NA NA NA NA NA NA NA NA NA NA 5.878616e+05 3.550553e+07 1.000000e+00 3.000000e+01 1.125000e+03 1.125000e+03 2.147484e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric minimum_nights_avg_ntm 1 0.9999453 NA NA NA NA NA NA NA NA NA NA NA 9.597167e+00 3.451717e+01 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 1.124000e+03 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric maximum_nights_avg_ntm 1 0.9999453 NA NA NA NA NA NA NA NA NA NA NA 5.875923e+05 3.548948e+07 1.000000e+00 3.000000e+01 1.125000e+03 1.125000e+03 2.147484e+09 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric availability_30 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 3.806758e+00 7.721591e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00 3.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric availability_60 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 1.019412e+01 1.775509e+01 0.000000e+00 0.000000e+00 0.000000e+00 1.500000e+01 6.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric availability_90 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 1.831753e+01 2.927484e+01 0.000000e+00 0.000000e+00 0.000000e+00 3.500000e+01 9.000000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric availability_365 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 8.556086e+01 1.245070e+02 0.000000e+00 0.000000e+00 0.000000e+00 1.620000e+02 3.650000e+02 <U+2587><U+2581><U+2581><U+2581><U+2582>
numeric number_of_reviews 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 2.278904e+01 5.101942e+01 0.000000e+00 1.000000e+00 4.000000e+00 1.700000e+01 6.550000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric number_of_reviews_ltm 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 2.679899e+00 9.356744e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 4.470000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric number_of_reviews_l30d 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 4.519904e-01 1.654162e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.040000e+02 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric review_scores_rating 3572 0.8046807 NA NA NA NA NA NA NA NA NA NA NA 4.626417e+00 8.043395e-01 0.000000e+00 4.610000e+00 4.850000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_accuracy 3897 0.7869094 NA NA NA NA NA NA NA NA NA NA NA 4.791855e+00 4.112054e-01 0.000000e+00 4.750000e+00 4.920000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_cleanliness 3895 0.7870188 NA NA NA NA NA NA NA NA NA NA NA 4.637258e+00 5.258907e-01 0.000000e+00 4.500000e+00 4.800000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_checkin 3909 0.7862533 NA NA NA NA NA NA NA NA NA NA NA 4.826007e+00 3.900019e-01 0.000000e+00 4.800000e+00 4.960000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_communication 3898 0.7868548 NA NA NA NA NA NA NA NA NA NA NA 4.828607e+00 3.978867e-01 0.000000e+00 4.810000e+00 4.970000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_location 3908 0.7863080 NA NA NA NA NA NA NA NA NA NA NA 4.759599e+00 3.838505e-01 0.000000e+00 4.670000e+00 4.880000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric review_scores_value 3910 0.7861986 NA NA NA NA NA NA NA NA NA NA NA 4.668290e+00 4.501632e-01 0.000000e+00 4.550000e+00 4.760000e+00 5.000000e+00 5.000000e+00 <U+2581><U+2581><U+2581><U+2581><U+2587>
numeric calculated_host_listings_count 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 3.025153e+00 7.454440e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 7.600000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric calculated_host_listings_count_entire_homes 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 1.942257e+00 5.416078e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 4.400000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric calculated_host_listings_count_private_rooms 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 8.859361e-01 3.247792e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 4.500000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric calculated_host_listings_count_shared_rooms 0 1.0000000 NA NA NA NA NA NA NA NA NA NA NA 1.392170e-01 2.017200e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.800000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>
numeric reviews_per_month 3572 0.8046807 NA NA NA NA NA NA NA NA NA NA NA 8.155416e-01 1.577983e+00 1.000000e-02 9.000000e-02 3.000000e-01 1.000000e+00 9.086000e+01 <U+2587><U+2581><U+2581><U+2581><U+2581>

In a next step, we transform “price” and some of the other variables into numerics. Also, we use “ggpairs” to get a feel for the correlation between some of the variables. For instance, it is interesting to find out whether “accommodates” correlates with “minimum_nights”. Our intuition was that very large Airbnbs may have a higher minimum_nights number, since the cleaning effort for the host is increased.

The output below indicates that this intuition is not confirmed by the data, since there is actually a slightly negative correlation between minimum_nights and accommodates. As one would expect, a higher number of accommodates is correlated with a higher price. The density plots also help us see that for example review_scores_rating is left-skewed with a large number of rentals having very high ratings. Another interesting observation is that maximum_nights has a peak at 365, which means that many rentals cannot be booked for more than a year. This may be due to regulatory reasons, which keeps hosts to from renting out their properties for very long periods of time.

These are some other questions that we can now answer

How many variables/columns? How many rows/observations?

There are 74 variables and 18,288 observations.

Which variables are numbers?

The following variables are numbers: id, scrape_id, host_id, latitude, longitude, accommondates, bathrooms, bedrooms, beds, price, maximum_nights, minimum_nights, number_of_reviews, number_of_reviews_ltm, number_of_reviews_130d, reviews_per_month,calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms, reviews_per_month;

Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?)

The following variables are factors: host_response_rate, host_acceptance_rate, host_is_superhost, host_has_profile_pic, host_identity_verified, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, instant_bookable;

What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

We were not able to observe strong correlation between any of the variables we selected for testing. It therefore appears that there is no linear relationship between the price, the accommodates, number of reviews, review scores rating, maximum or minimum nights. We log-transform the price variable at a later stage in order to normalize higher dispersion in very expensive rentals.

We are now at the third step of the Exploratory Data Analysis section.

2.3 Creating informative visualizations

In this step, we plot some graphs in order to deepen our understanding of how different variables are distributed. We do not exclusively focus on variables and relationships that may impact price in our regression model, but rather try to get a feel of the dataset in general.

In the first chart, we learn that the distribution of beds varies with the nr. of accommodates of a specific rental. This is a rather straight-forward relationship, but it helps to start with something that confirms the intuition. In general, the interquartile range increases with the nr. of accommodates per Airbnb. One can assume that this is due to any extra beds in the form of sofa beds, which are likely more frequent in larger rentals. These more “improvised” beds are less likely to be found in smaller rentals.

The second chart tells us that superhosts (those with many rentals and a lot of experience) have a higher median review rating and a smaller interquartile range. One can assume that superhosts more consistently provide a high quality rental experience and therefore the spread of different ratings is smaller. We can also see that there are certain rentals for which the data set does not provide information on host status (“NA”).

The third chart shows that review ratings among different room types vary. Shared rooms tend to have the worst ratings, which is likely due to the fact that the rental experience is dependent on another visitor.

The fourth chart shows the availability of rentals in different neighbourhoods. For example, in “Mitte” the availability is a lot lower than in Spandau. This is likely due to the fact that Mitte is in a very central location, where the demand for Airbnbs is really high.

For the fifth chart, we filter out all the rentals that have a price >400 to avoid the distorting effect of very expensive rentals. In a later step, we will log-transform the price variable to achieve this. For now, the chart tells us that different room types have different price distributions. The hotel room category, where you also pay for using the amenities of the respective hotel, is unsurprisingly the most expensive one. What is more interesting is that shared rooms and private rooms have very similar distributions. One reason could be that shared rooms are a lot larger, which makes up for the lack of privacy in terms of price.

From the sixth chart, we learn that whether a host has a profile picture seems to impact the communication rating for a specific rental. Hosts that have a picture tend to score higher in this category. After all, Airbnb customers seem to like to see who their host is and incorporate that into the communication rating they give.

Now, we focus on getting our data set in the right format for our regression analysis.

First, we look at the variable property_type. We can use the count function to determine how many categories there are and their frequency. The four most common property types are entire rental units (~50.0%), private rooms in rental units (~35.7%), entire condominiums (~2.7%), and entire serviced apartments (~2.0%). Together, these property types make up for ~90.3% of the whole sample.

Property Type Overview
property_type count prop_in_percentage
Entire rental unit 8778 47.9986877
Private room in rental unit 6534 35.7283465
Entire condominium (condo) 485 2.6520122
Entire serviced apartment 362 1.9794401
Entire loft 327 1.7880577
Private room in residential home 237 1.2959318
Private room in condominium (condo) 219 1.1975066
Entire residential home 183 1.0006562
Room in hotel 175 0.9569116
Shared room in rental unit 117 0.6397638
Room in boutique hotel 96 0.5249344
Private room in loft 80 0.4374453
Shared room in hostel 75 0.4101050
Private room in bed and breakfast 68 0.3718285
Entire guesthouse 56 0.3062117
Private room in townhouse 55 0.3007437
Private room in hostel 48 0.2624672
Room in serviced apartment 47 0.2569991
Entire guest suite 32 0.1749781
Room in aparthotel 31 0.1695101
Private room in serviced apartment 29 0.1585739
Entire bungalow 25 0.1367017
Entire townhouse 24 0.1312336
Houseboat 18 0.0984252
Private room 18 0.0984252
Private room in guest suite 13 0.0710849
Private room in pension 13 0.0710849
Entire place 10 0.0546807
Boat 9 0.0492126
Camper/RV 9 0.0492126
Private room in guesthouse 9 0.0492126
Room in hostel 9 0.0492126
Private room in villa 8 0.0437445
Tiny house 8 0.0437445
Entire villa 7 0.0382765
Entire cabin 6 0.0328084
Private room in casa particular 6 0.0328084
Private room in tiny house 5 0.0273403
Shared room in condominium (condo) 5 0.0273403
Entire cottage 4 0.0218723
Private room in boat 4 0.0218723
Private room in bungalow 4 0.0218723
Shared room in boutique hotel 4 0.0218723
Shared room in loft 3 0.0164042
Shared room in residential home 3 0.0164042
Entire home/apt 2 0.0109361
Private room in cottage 2 0.0109361
Room in bed and breakfast 2 0.0109361
Shared room in bed and breakfast 2 0.0109361
Shared room in serviced apartment 2 0.0109361
Shared room in tiny house 2 0.0109361
Treehouse 2 0.0109361
Bus 1 0.0054681
Casa particular 1 0.0054681
Castle 1 0.0054681
Earth house 1 0.0054681
Entire chalet 1 0.0054681
Floor 1 0.0054681
Island 1 0.0054681
Private room in cave 1 0.0054681
Private room in floor 1 0.0054681
Private room in houseboat 1 0.0054681
Private room in tipi 1 0.0054681
Shared room 1 0.0054681
Shared room in boat 1 0.0054681
Shared room in cabin 1 0.0054681
Shared room in townhouse 1 0.0054681
Shipping container 1 0.0054681

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other.

We can quickly check if the simplification worked.

Simplification of Property Type
property_type prop_type_simplified n
Entire rental unit Entire rental unit 8778
Private room in rental unit Private room in rental unit 6534
Entire condominium (condo) Entire condominium (condo) 485
Entire serviced apartment Entire serviced apartment 362
Entire loft Other 327
Private room in residential home Other 237
Private room in condominium (condo) Other 219
Entire residential home Other 183
Room in hotel Other 175
Shared room in rental unit Other 117
Room in boutique hotel Other 96
Private room in loft Other 80
Shared room in hostel Other 75
Private room in bed and breakfast Other 68
Entire guesthouse Other 56
Private room in townhouse Other 55
Private room in hostel Other 48
Room in serviced apartment Other 47
Entire guest suite Other 32
Room in aparthotel Other 31
Private room in serviced apartment Other 29
Entire bungalow Other 25
Entire townhouse Other 24
Houseboat Other 18
Private room Other 18
Private room in guest suite Other 13
Private room in pension Other 13
Entire place Other 10
Boat Other 9
Camper/RV Other 9
Private room in guesthouse Other 9
Room in hostel Other 9
Private room in villa Other 8
Tiny house Other 8
Entire villa Other 7
Entire cabin Other 6
Private room in casa particular Other 6
Private room in tiny house Other 5
Shared room in condominium (condo) Other 5
Entire cottage Other 4
Private room in boat Other 4
Private room in bungalow Other 4
Shared room in boutique hotel Other 4
Shared room in loft Other 3
Shared room in residential home Other 3
Entire home/apt Other 2
Private room in cottage Other 2
Room in bed and breakfast Other 2
Shared room in bed and breakfast Other 2
Shared room in serviced apartment Other 2
Shared room in tiny house Other 2
Treehouse Other 2
Bus Other 1
Casa particular Other 1
Castle Other 1
Earth house Other 1
Entire chalet Other 1
Floor Other 1
Island Other 1
Private room in cave Other 1
Private room in floor Other 1
Private room in houseboat Other 1
Private room in tipi Other 1
Shared room Other 1
Shared room in boat Other 1
Shared room in cabin Other 1
Shared room in townhouse Other 1
Shipping container Other 1

Next, we look at the Minimum_nihts variabe to only include listings in our regression analysis that are intended for travel purposes. At first, we check the distribution of minimum_nights.

Minimum Nights Data
minimum_nights count
2 4236
1 4194
3 3282
4 1368
5 1293
7 864
30 418
6 382
14 363
60 298
10 284
90 195
20 117
28 95
15 82
8 75
21 72
180 53
12 43
9 39
25 37
13 33
29 32
61 31
62 28
22 26
120 23
31 17
183 16
45 14
93 13
18 11
150 11
16 10
89 10
91 10
40 9
58 9
100 9
357 9
11 7
19 7
50 7
56 7
365 7
23 6
27 6
181 6
65 5
92 5
200 5
1000 5
17 4
55 4
63 4
70 4
80 4
85 4
300 4
24 3
42 3
59 3
99 3
118 3
182 3
186 3
500 3
1124 3
26 2
33 2
35 2
83 2
84 2
140 2
185 2
240 2
360 2
34 1
37 1
48 1
49 1
51 1
71 1
75 1
82 1
87 1
88 1
98 1
101 1
105 1
119 1
122 1
125 1
128 1
129 1
170 1
179 1
184 1
187 1
188 1
210 1
250 1
270 1
304 1
355 1
356 1
720 1
1100 1

We can now answer some more questions

What are the most common values for the variable minimum_nights?

The most common values for the variable minimum_nights are 2, 1, and 3 nights. This answer also makes sense, given many people use Airbnb for city trips, so the mininmal duration should not be too limited, but short stays and the cost or work to clean an Airbnb for a one night booking might not be worth it for many hosts.

Is there any value among the common values that stands out?

Especially the 30, 14 and 60 night minimum limits stand out at a first glance. These are usually longer-term Airbnbs that are used by interns or workers that are on assembly trips. It is also logical for some landlords to rent out their rooms over the longer term, as also for a longer stay the room only has to be tided once. The highest minimum night requirement is 1,124 nights. This observation must be investigated further to understand the reason behind such a high value.

What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

The usual reasons for these longer minimum stays are to draw bookings from people that are on work projects, internships or are looking for a temporary stay while looking for a permanent accommodation. The benefit for the host is the lower frequency of cleaning and setting up the rooms.

Next, we filter the airbnb data so that it only includes observations with minimum_nights <= 4.

After making these adjustments, we want to analyze the distribution of rentals in Berlin. As the chart below shows, there are certain quarters with particularly many rentals. For instance, in Kreuzberg (a southern quarter in the city), there are many rentals available. This may be due to the types of buildings and the general infrastructure in the area. Kreuzberg is home to many restaurants and bars, which makes it an interesting area for tourists. Interestingly, there are fewer Airbnb in the heart of the city. Likely this is because the political district as well as many high-end hotels are located here, which leaves less room for Airbnbs.

#data visualization that assigns each rental to a specific map location using longitude and latitude figures
leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 0.5, 
                   fillColor = "red", 
                   fillOpacity = 0.3, 
                   popup = ~listing_url,
                   label = ~property_type)

As we get closer to our regression model, we create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

In the next section, we create a new column called “log(price_4_nights)”. We should use log(price_4_nights) because there are some outlier dentals in price_4_nights and using log(price_4_nights) could help normalize the dataset. In addition, the use of log can make the distribution behave better and help with finding the regression model. The regression model assumes normality and running a log-transformation helps to come closer to this assumption. It also ensures that the assumption of constant variance is met.

We can use histograms to examine the distributions of price_4_nights and log(price_4_nights).

3 Model Selection and Validation

We now have all variables in the correct format and can start model selection and validation.We start with a model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

3.1 Model 1

                                                  Estimate Std. Error t value
(Intercept)                                      5.2822983  0.0449473 117.522
prop_type_simplifiedEntire rental unit          -0.1427983  0.0318703  -4.481
prop_type_simplifiedEntire serviced apartment    0.4198925  0.0464712   9.036
prop_type_simplifiedOther                       -0.1408665  0.0342009  -4.119
prop_type_simplifiedPrivate room in rental unit -0.5365301  0.0321607 -16.683
number_of_reviews                               -0.0002594  0.0000804  -3.227
review_scores_rating                             0.0426309  0.0068832   6.194
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire rental unit          7.53e-06 ***
prop_type_simplifiedEntire serviced apartment    < 2e-16 ***
prop_type_simplifiedOther                       3.84e-05 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
number_of_reviews                                0.00126 ** 
review_scores_rating                            6.12e-10 ***

Residual standard error: 0.4882 on 9866 degrees of freedom
Multiple R-squared:  0.1648,    Adjusted R-squared:  0.1643 
F-statistic: 324.4 on 6 and 9866 DF,  p-value: < 2.2e-16
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.011769  4        1.001464
number_of_reviews    1.015923  1        1.007930
review_scores_rating 1.005692  1        1.002842

Because the dependent variable (i.e., price_4_nights) is log-transformed, the interpretation of the coefficients requires one additional step. The coefficient has to be exponentiated to reverse the log-transformation: (e^0.0426309-1)*100=4.3553. This adjusted coefficient means that for every unit change in review_scores_rating, the price_4_nights increases by about 4.4%. This makes intuitive sense: the higher the rating, the more the host can charge. The t-value of >6 indicates that this relationship is statistically significant.

To interpret the coefficients, they have to be transformed like in the previous section. This leads to the following values:

prop_type_simplifiedEntire rental unit: -13.3071 prop_type_simplifiedEntire serviced apartment: 52.17979 prop_type_simplifiedOther:-13.1395 prop_type_simplifiedPrivate room in rental unit: -41.5226

The variable “Entire condominium (condo)” is taken as the base value. Hence, the coefficients correspond to the %-change in price_4_nights over the base case that the Airbnb is of prop_type “Entire condominium (condo)”. For instance, if you rent an “Entire serviced apartment”, the price_4_nights is increased by 52% over the price that it would cost you if you had rented an “Entire condominium (condo)”. The same logic also applies to the other variables, which are also all statistically significant. It also makes intuitive sense that for example “Entire serviced apartments” will be significantly more costly, because you pay for amenities such as regular cleaning or even breakfast. In a further analysis, one could split up the “Other” category further, to find out more about other property types.

Next, we want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. We fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.

3.2 Model 2

Room Types
room_type count
Entire home/apt 5627
Private room 4061
Shared room 93
Hotel room 92
                                                  Estimate Std. Error t value
(Intercept)                                      5.319e+00  4.328e-02 122.882
prop_type_simplifiedEntire rental unit          -1.433e-01  3.068e-02  -4.673
prop_type_simplifiedEntire serviced apartment    4.187e-01  4.473e-02   9.361
prop_type_simplifiedOther                        3.023e-02  3.728e-02   0.811
prop_type_simplifiedPrivate room in rental unit -2.549e-01  4.317e-02  -5.905
number_of_reviews                               -2.411e-04  7.746e-05  -3.112
review_scores_rating                             3.479e-02  6.631e-03   5.247
room_typeHotel room                              6.465e-01  5.384e-02  12.009
room_typePrivate room                           -2.819e-01  3.015e-02  -9.352
room_typeShared room                            -1.171e+00  5.360e-02 -21.841
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire rental unit          3.01e-06 ***
prop_type_simplifiedEntire serviced apartment    < 2e-16 ***
prop_type_simplifiedOther                        0.41740    
prop_type_simplifiedPrivate room in rental unit 3.65e-09 ***
number_of_reviews                                0.00186 ** 
review_scores_rating                            1.58e-07 ***
room_typeHotel room                              < 2e-16 ***
room_typePrivate room                            < 2e-16 ***
room_typeShared room                             < 2e-16 ***

Residual standard error: 0.4699 on 9863 degrees of freedom
Multiple R-squared:  0.2265,    Adjusted R-squared:  0.2258 
F-statistic: 320.9 on 9 and 9863 DF,  p-value: < 2.2e-16
                          GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 11.354524  4        1.354865
number_of_reviews     1.017942  1        1.008931
review_scores_rating  1.007537  1        1.003761
room_type            11.342129  3        1.498934
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.227e+00  3.220e-02 162.336  < 2e-16 ***
number_of_reviews     -2.131e-04  7.924e-05  -2.690  0.00716 ** 
review_scores_rating   3.202e-02  6.792e-03   4.715 2.46e-06 ***
room_typeHotel room    7.806e-01  5.063e-02  15.418  < 2e-16 ***
room_typePrivate room -3.956e-01  9.958e-03 -39.725  < 2e-16 ***
room_typeShared room  -1.038e+00  5.038e-02 -20.604  < 2e-16 ***

Residual standard error: 0.4816 on 9867 degrees of freedom
Multiple R-squared:  0.1873,    Adjusted R-squared:  0.1869 
F-statistic: 454.7 on 5 and 9867 DF,  p-value: < 2.2e-16
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.014175  1        1.007062
review_scores_rating 1.006402  1        1.003196
room_type            1.010664  3        1.001770

There is some multicollinearity between room_type and property_type, as one would expect. Because room_type adds more explanatory power to the model, we therefore exclude property_type from the model. All room_type variables are statistically significant and tell us different things about price_4_nights:

  1. Hotel rooms increase the price for an Airbnb over the base case scenario that an entire home is rented (the excluded variable). This makes sense, since the tenants also pay for the additional hotel infrastructure that they get to use.
  2. Renting a private room reduces the price of the rental compared to the base case. This also makes intuitive sense, since these rooms may be in the hosts own house or otherwise less valuable than renting an entire apartment.
  3. Renting a shared room also reduces price, which makes sense since the cost of the rental is split among a greater number of heads.

We now go on by adding other variables to the model to increase its explanatory power. Currently, we can only explain c. 19% of the variation in price with our model. We therefore include more variables to improve on this. Model3 includes the number of bathrooms, bedrooms, beds, and size of the house (accomodates) of a rental.

3.3 Model 3

                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.404e+00  3.533e-02 152.975  < 2e-16 ***
number_of_reviews      8.847e-05  7.944e-05   1.114    0.265    
review_scores_rating   2.835e-02  6.977e-03   4.063 4.89e-05 ***
room_typeHotel room    7.528e-01  4.879e-02  15.429  < 2e-16 ***
room_typePrivate room -4.816e-01  1.064e-02 -45.262  < 2e-16 ***
room_typeShared room  -8.967e-01  4.964e-02 -18.065  < 2e-16 ***
bedrooms               1.729e-01  1.098e-02  15.737  < 2e-16 ***
beds                   9.605e-03  5.892e-03   1.630    0.103    
accommodates          -1.239e-01  5.405e-03 -22.922  < 2e-16 ***

Residual standard error: 0.4581 on 9058 degrees of freedom
Multiple R-squared:  0.2611,    Adjusted R-squared:  0.2605 
F-statistic: 400.2 on 8 and 9058 DF,  p-value: < 2.2e-16
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.032268  1        1.016006
review_scores_rating 1.007340  1        1.003664
room_type            1.254522  3        1.038516
bedrooms             2.279887  1        1.509930
beds                 3.014537  1        1.736242
accommodates         3.928598  1        1.982069

Based on this model, we learn that bedrooms and the size of the house are significant predictors of price_4_nights, which can be seen from a high absolute t-statistic. As the nr. of bedrooms increases, the price of the rental also increases. As house size increases, the price per person actually decreases (remember that we divided by “accommodates” when adjusting the price_4_nights variable). This makes sense, since the price is then shared among a greater number of heads. Beds is not a statistically significant predictor of price_4_nights. Interestingly, there is some multicollinearity between bedrooms, beds, and accommodates but not enough to disregard the model.

Comparing Model3 to Model2, we increase the adjusted R-squared to 0.26, which means that we can now explain more than a quarter of the variation in price. In Model4, we add the impact of the superhost variable (host_is_superhost) and check whether they can command a pricing premium, after controlling for other variables.

3.4 Model 4

                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.426e+00  3.529e-02 153.769  < 2e-16 ***
number_of_reviews     -1.552e-04  8.479e-05  -1.831  0.06717 .  
review_scores_rating   2.191e-02  6.998e-03   3.131  0.00175 ** 
room_typeHotel room    7.441e-01  4.859e-02  15.314  < 2e-16 ***
room_typePrivate room -4.824e-01  1.060e-02 -45.508  < 2e-16 ***
room_typeShared room  -8.731e-01  4.952e-02 -17.631  < 2e-16 ***
bedrooms               1.733e-01  1.094e-02  15.840  < 2e-16 ***
beds                   9.311e-03  5.867e-03   1.587  0.11255    
accommodates          -1.253e-01  5.385e-03 -23.272  < 2e-16 ***
host_is_superhostTRUE  1.028e-01  1.308e-02   7.866 4.09e-15 ***

Residual standard error: 0.4561 on 9048 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.2667,    Adjusted R-squared:  0.266 
F-statistic: 365.7 on 9 and 9048 DF,  p-value: < 2.2e-16
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.186099  1        1.089082
review_scores_rating 1.022240  1        1.011059
room_type            1.260387  3        1.039323
bedrooms             2.280320  1        1.510073
beds                 3.014382  1        1.736198
accommodates         3.933709  1        1.983358
host_is_superhost    1.190880  1        1.091274

Based on this model, superhosts charge a pricing premium, which can be seen from the positive coefficient and the high t-statistic. This makes sense, since these kinds of hosts are typically very professional in the way that they manage their apartments, which translates into higher customer value and thereby the ability to charge higher prices.

For Model5, we include the fact that some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t.

3.5 Model 5

                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.4034933  0.0354634 152.368  < 2e-16 ***
number_of_reviews     -0.0002050  0.0000851  -2.409 0.016037 *  
review_scores_rating   0.0241972  0.0069981   3.458 0.000547 ***
room_typeHotel room    0.7049356  0.0489961  14.388  < 2e-16 ***
room_typePrivate room -0.4832893  0.0105825 -45.669  < 2e-16 ***
room_typeShared room  -0.8717889  0.0494345 -17.635  < 2e-16 ***
bedrooms               0.1750736  0.0109243  16.026  < 2e-16 ***
beds                   0.0092061  0.0058568   1.572 0.116021    
accommodates          -0.1274219  0.0053890 -23.645  < 2e-16 ***
host_is_superhostTRUE  0.0998504  0.0130642   7.643 2.34e-14 ***
instant_bookableTRUE   0.0589553  0.0104401   5.647 1.68e-08 ***

Residual standard error: 0.4553 on 9047 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.2693,    Adjusted R-squared:  0.2685 
F-statistic: 333.4 on 10 and 9047 DF,  p-value: < 2.2e-16
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.198945  1        1.094963
review_scores_rating 1.025670  1        1.012754
room_type            1.286166  3        1.042836
bedrooms             2.282299  1        1.510728
beds                 3.014412  1        1.736206
accommodates         3.952343  1        1.988050
host_is_superhost    1.192851  1        1.092177
instant_bookable     1.056667  1        1.027943

As can be seen from the summary statistics, the variable “instant_bookable” is also a significant predictor of price_4_nights. The regression analysis reveals that when controlling for the other listed variables, a rental with an instant-booking option is c. 6.07% more expensive than one without. The customer pays a premium for instant confirmation that the rental can be booked. The t-statistic for the variable is high and there is little multicollinearity with other variables, which is why it should be kept in the model.

For Model6, we look at neighbourhoods. For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in the model. Instead, we manipulate the neighbourhood_group_cleansed variable and divide neighbourhoods into the following 4 groups:

City West: Steglitz - Zehlendorf, Spandau, Charlottenburg-Wilm. City North: Reinickendorf, Pankow, Lichtenberg City Central: Mitte, Friedrichshain-Kreuzberg City East: Marzahn - Hellersdorf, Treptow - Köpenick, Neukölln, Tempelhof - Schöneberg

This grouping is based on (i) the geographic location of the neighbourhoods and (ii) the judgement of a Berlin local. It pays special consideration for the particularly sought-after quarters of “Mitte” and “Friedrichshain-Kreuzberg”, which create their own group.

3.6 Model 6

                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.4698939  0.0354801 154.168  < 2e-16 ***
number_of_reviews     -0.0002698  0.0000843  -3.201 0.001375 ** 
review_scores_rating   0.0240103  0.0069220   3.469 0.000525 ***
room_typeHotel room    0.6984246  0.0484656  14.411  < 2e-16 ***
room_typePrivate room -0.4802182  0.0104892 -45.782  < 2e-16 ***
room_typeShared room  -0.9034713  0.0489465 -18.458  < 2e-16 ***
bedrooms               0.1780691  0.0108096  16.473  < 2e-16 ***
beds                   0.0099778  0.0058002   1.720 0.085419 .  
accommodates          -0.1288824  0.0053389 -24.140  < 2e-16 ***
host_is_superhostTRUE  0.0997023  0.0129309   7.710 1.39e-14 ***
instant_bookableTRUE   0.0599617  0.0103274   5.806 6.61e-09 ***
areasCity East        -0.1705299  0.0119988 -14.212  < 2e-16 ***
areasCity North       -0.0903306  0.0125357  -7.206 6.23e-13 ***
areasCity West        -0.0528050  0.0166278  -3.176 0.001500 ** 

Residual standard error: 0.4502 on 9044 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.2858,    Adjusted R-squared:  0.2848 
F-statistic: 278.5 on 13 and 9044 DF,  p-value: < 2.2e-16
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.203308  1        1.096954
review_scores_rating 1.026409  1        1.013119
room_type            1.297248  3        1.044329
bedrooms             2.285705  1        1.511855
beds                 3.023981  1        1.738960
accommodates         3.967810  1        1.991936
host_is_superhost    1.195353  1        1.093322
instant_bookable     1.057612  1        1.028403
areas                1.024096  3        1.003976

The regression table confirms that neighbourhood is indeed a significant driver or price. City_Central is the base category for the analysis and is omitted in the model. Relative to this base case, all other neighbourhoods are cheaper. For example, an Airbnb in City_East will be c. 15.7% less expensive compared to the same apartment in City_Centre. Taking Berlin’s history into account, this makes logical sense. The eastern part of the city is the former DDR part, where prices tend to be lower.

For Model7, we include the effect of avalability_30 and reviews_per_month on price.

The variable “availability_30” is also a significant predictor of price_4_nights. The t-statistic is very high and the coefficient is positive, which means that, controlling for all the other variables, the impact of availability in the next month on price is positive.

The variable reviews_per_month does not seem to be a significant predictor as the t value is less than 2. This makes sense, since number of reviews per month are not necessarily related to the quality of the properties and therefore the price of the properties. A cheap rental could equally well have a high number of reviews per month as a medium-priced or more expensive rental. Therefore, this variable is removed from the final version of model 7, along with “beds” which also has a t-statistic <2 and is thereby not statistically relevant.

3.7 Model 7

                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.369e+00  3.320e-02 161.707  < 2e-16 ***
number_of_reviews     -5.672e-04  9.376e-05  -6.049 1.51e-09 ***
review_scores_rating   4.052e-02  6.480e-03   6.253 4.20e-10 ***
room_typeHotel room    4.490e-01  4.576e-02   9.813  < 2e-16 ***
room_typePrivate room -4.839e-01  9.807e-03 -49.345  < 2e-16 ***
room_typeShared room  -1.112e+00  4.605e-02 -24.143  < 2e-16 ***
bedrooms               1.931e-01  1.010e-02  19.128  < 2e-16 ***
beds                   2.500e-03  5.416e-03   0.462    0.644    
accommodates          -1.427e-01  4.996e-03 -28.574  < 2e-16 ***
host_is_superhostTRUE  9.383e-02  1.210e-02   7.756 9.74e-15 ***
instant_bookableTRUE   2.094e-02  9.767e-03   2.144    0.032 *  
reviews_per_month      5.388e-03  3.988e-03   1.351    0.177    
availability_30        2.342e-02  6.507e-04  36.000  < 2e-16 ***
areasCity East        -1.698e-01  1.119e-02 -15.173  < 2e-16 ***
areasCity North       -8.749e-02  1.169e-02  -7.483 7.94e-14 ***
areasCity West        -9.605e-02  1.555e-02  -6.176 6.85e-10 ***

Residual standard error: 0.4198 on 9042 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.379, Adjusted R-squared:  0.378 
F-statistic:   368 on 15 and 9042 DF,  p-value: < 2.2e-16
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.368e+00  3.316e-02 161.911  < 2e-16 ***
number_of_reviews     -4.976e-04  7.882e-05  -6.313 2.86e-10 ***
review_scores_rating   4.093e-02  6.471e-03   6.325 2.65e-10 ***
room_typeHotel room    4.461e-01  4.571e-02   9.758  < 2e-16 ***
room_typePrivate room -4.847e-01  9.779e-03 -49.566  < 2e-16 ***
room_typeShared room  -1.111e+00  4.529e-02 -24.535  < 2e-16 ***
bedrooms               1.932e-01  9.989e-03  19.337  < 2e-16 ***
accommodates          -1.412e-01  3.926e-03 -35.977  < 2e-16 ***
host_is_superhostTRUE  9.515e-02  1.206e-02   7.890 3.36e-15 ***
instant_bookableTRUE   2.264e-02  9.684e-03   2.338   0.0194 *  
availability_30        2.359e-02  6.400e-04  36.857  < 2e-16 ***
areasCity East        -1.701e-01  1.119e-02 -15.205  < 2e-16 ***
areasCity North       -8.756e-02  1.169e-02  -7.490 7.53e-14 ***
areasCity West        -9.546e-02  1.553e-02  -6.147 8.23e-10 ***

Residual standard error: 0.4198 on 9044 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.3789,    Adjusted R-squared:  0.378 
F-statistic: 424.4 on 13 and 9044 DF,  p-value: < 2.2e-16
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.209670  1        1.099850
review_scores_rating 1.031352  1        1.015555
room_type            1.304914  3        1.045355
bedrooms             2.244379  1        1.498125
accommodates         2.466464  1        1.570498
host_is_superhost    1.195437  1        1.093360
instant_bookable     1.069316  1        1.034078
availability_30      1.121507  1        1.059012
areas                1.027859  3        1.004590

As the summary statistics above indicate, our final model comprises of statistically significant variables only and has an adjusted R-squared value of 0.378. This means that the model helps to explain 37.8% of the variation in the log-transformed price. At first, this seems like a mediocre model, since almost 2/3 of the variation in price remains unexplained. However, given the fact that rental prices are very subjective to their specific location (as opposed to the mere neighborhood), the quality of the amenities, the last date of redevelopment, and many other factors, we consider an R-squared of almost 40% as satisfactory. For instance, the addition of the simplified neighborhood variable only added c. 2 percentage points of explanatory power in our analysis and we are confident that in a future investigation one should put more emphasis on this variable and possibly consider factors such as “distance to public transport” or “distance to airport”.

3.8 Model 7 RMSE

Next to the looking at explanatory power, we should also analyze our model using RMSE. This analysis reveals whether the model actually works on unknown data, or whether it is overfitted to the specifics of the training data. The analysis below proves that model 7 is a good model based on two things. First, rmse_train value is small (0.4193), which means predicated value and actual value are pretty close. Second, the difference (0.001) between rmse_train and rmse_test is small, which means it is a generalized model and can be applied to not only the training set, but also has the predict power to new data.

[1] 0.4193265
[1] 0.4200928

In a future study, it would be interesting to apply the same model to different cities and test how it performs there. One can hypothesize, that in different regions of the world, some variables may have a particularly strong effect on price. For example, in regions that are more unsafe or more heterogeneous than Berlin, the neighborhood variable may be of greater significance. In this case, the RMSE would reveal that the model must be adapted because the accuracy in the test data set would be a lot lower than in the training data.

3.9 Summary of All Models

To provide an overview of the models that we worked with, we can create a summary table of the important parameters. From this table, we can see that between model 2 and 3, as well as between model 6 and 7, we could increase the explanatory power significantly.

Comparison of models
names model1 model2 model3 model4 model5 model6 model7
1 (Intercept) 5.28229830751976 5.22714373833038 5.40399208243834 5.42645420243141 5.40349327285124 5.46989389255496 5.36842999161987
2 (0.0449472631482688) (0.0321995916350208) (0.0353259904742744) (0.035289591206124) (0.0354633763379476) (0.0354800637557067) (0.0331566415080243)
3 prop_type_simplifiedEntire rental unit -0.142798334829136
4 (0.0318703038020704)
5 prop_type_simplifiedEntire serviced apartment 0.419892547407141
6 (0.0464712067554619)
7 prop_type_simplifiedOther -0.140866464198057
8 (0.0342008917886018)
9 prop_type_simplifiedPrivate room in rental unit -0.536530137341479
10 (0.0321607106019711)
11 number_of_reviews -0.000259436899185359 -0.000213135937103914 8.84671142095089e-05 -0.000155228345967399 -0.000204973080393352 -0.000269826267833803 -0.000497645902528479
12 (8.04033957299506e-05) (7.92414504752674e-05) (7.94353786233337e-05) (8.47907109431558e-05) (8.51034925945332e-05) (8.4300180803711e-05) (7.88238495811527e-05)
13 review_scores_rating 0.0426308955354075 0.0320211919667921 0.0283494448486826 0.0219118282533087 0.024197202524921 0.0240102896753886 0.0409268479410226
14 (0.00688315174779722) (0.00679192046592831) (0.00697744574452695) (0.00699833585033276) (0.00699813281715178) (0.00692198931760884) (0.00647080540437282)
15 room_typeHotel room 0.780634988210945 0.752782220011443 0.744057667615733 0.704935628113934 0.69842463940715 0.446063814864002
16 (0.0506301012620254) (0.0487903298450189) (0.0485865117668741) (0.0489960635143676) (0.0484656478641766) (0.0457115231882595)
17 room_typePrivate room -0.39559293692107 -0.481592413788568 -0.482358095085801 -0.483289284572908 -0.480218241201602 -0.484720607704976
18 (0.00995839000720386) (0.0106400862010558) (0.0105993130247822) (0.0105825518896458) (0.0104891694754441) (0.00977928191805773)
19 room_typeShared room -1.03813268110591 -0.896742508228079 -0.873073776663116 -0.871788918955122 -0.903471308487009 -1.11125474767876
20 (0.0503848048096999) (0.0496395550413163) (0.0495183051346452) (0.0494345208281935) (0.0489465133659055) (0.0452931953205682)
21 bedrooms 0.172862216540498 0.173257083385276 0.175073569908261 0.178069114712692 0.193160289603151
22 (0.0109846735623061) (0.0109381654628655) (0.0109242795652863) (0.01080958068873) (0.00998920781170885)
23 beds 0.00960474201285841 0.00931063760091393 0.00920604533187131 0.00997785257783069
24 (0.00589230756540298) (0.00586678463610014) (0.00585682536499641) (0.0058001968671528)
25 accommodates -0.123896408268763 -0.12533236208788 -0.127421867291505 -0.128882438569403 -0.141229135391807
26 (0.00540513220772034) (0.0053854661040724) (0.00538901527427583) (0.0053388760786438) (0.00392550899950059)
27 host_is_superhostTRUE 0.102849392035972 0.099850366785892 0.099702303431099 0.0951543855253531
28 (0.0130756147859674) (0.0130641520598685) (0.0129308906077658) (0.012059460924183)
29 instant_bookableTRUE 0.0589552819362381 0.0599616924652385 0.0226432049678445
30 (0.0104401401993214) (0.0103274425546466) (0.00968426946438939)
31 areasCity East -0.170529854770176 -0.170126099141686
32 (0.0119988140206354) (0.0111890528020354)
33 areasCity North -0.0903306430346683 -0.08756014347055
34 (0.0125357427913606) (0.011690098376889)
35 areasCity West -0.0528049478556697 -0.0954560273732202
36 (0.0166278069895657) (0.0155287451989084)
37 availability_30 0.0235899068419489
38 (0.000640043770630418)
1.1 #observations 9873 9873 9067 9058 9058 9058 9058
2.1 R squared 0.164794439590278 0.187279349472108 0.261146609635777 0.266707409823868 0.269283003785776 0.285849519718787 0.37890488910525
3.1 Adj. R Squared 0.164286509997489 0.186867511704535 0.260494056409577 0.265978007380059 0.268475313948024 0.284822987626388 0.378012116389457
4.1 Residual SE 0.488228611782495 0.481587468287354 0.458115518945514 0.456089758685603 0.45531323809596 0.450196959524332 0.419842835622159

3.10 Test on a Specific Target Listing

We now apply the following criteria to find our target listings:

  • Review score value is higher than 90% of full score

  • number of reviews are larger than 10

  • It has a private room

  • The host identity is verified and is a super host

  • It is in the neighborhood of Friedrichshain-Kreuzberg

  • Thee are two beds.

We believe the best model is model 7 as it has the highest adjusted square and the lowest residual SE. We will apply model 7 for price prediction and an interval for the lower and upper bound. Based on the output below, we can see that the prices range from c. 130€ to 310€. However, the “lwr” and “upr” columns tell us that we the spread for each estimated price is extremely high. This should not come as a surprise, since the explanatory power of our model is limited to less than 40%. In the next chapter, we briefly discuss options on how to improve on this in a future analysis.

Price Prediction
fit lwr upr
150.3410 65.94268 342.7586
131.3156 57.63560 299.1862
120.3012 52.79845 274.1063
141.5224 62.08586 322.5952
139.9085 61.40755 318.7621
239.3516 105.02540 545.4796
161.0301 70.67973 366.8762
129.5751 56.86532 295.2539
132.7859 58.28334 302.5236
162.5664 71.35089 370.3926
141.9125 62.28504 323.3386
141.0033 61.88571 321.2685
160.6203 70.49197 365.9832
310.4774 136.19860 707.7623
130.9168 57.46027 298.2791
160.7136 70.54209 366.1484

4 Findings and Recommendation

In this final section, we summarize the results of our selected model and discuss possible steps that could further improve the analysis.

As mentioned in the introduction, the overall goal of this analysis was to find a set of variables that would help us to predict Airbnb rental prices in Berlin. Our final model defines the following variables as statistically significant drivers of said rental prices:

  • Number of reviews
  • Review scores rating
  • Room type
  • Nr. of bedrooms
  • Nr. of people the rental accommodates
  • Host status (superhost/non-superhost)
  • Instant bookability
  • Availability in the next 30 days
  • Location of the rental

The coefficients for each of these variables tell us how rental prices are impacted. For instance, the 1.9 coefficient for “Nr. of bedrooms” tells us that rental prices tend to increase with a higher nr. of bedrooms. The standard error for each of the coefficient estimates provides us with an idea of how far we are away from the “true” value of the coefficient. Whenever the the ratio of coefficient to standard error is >2, we can be relatively sure that the variable is in fact statistically significant. In our model, this is the case for all variables. The t-statistic with the closest value to 2 is the one for instant bookability (c. 2.338), which means that even for this variable we can be >95% certain that it drives price.

If we look at the p-value of the overall model (c. 2.2*e^-16), we notice that this is extremely small. This simply means that our overall model helps to explain rental prices with almost absolute certainty. As previously states, the adjusted R-squared (adjusted for the nr. of variables) tells us how much of the variation in price can be explained, where 37.8% is clearly substantial.

Our RMSE analysis also showed us that the model works well on different sub groups of the data. In a further analysis, it would be interesting to apply the model to other cities as well and compare how the explanatory power changes and if any variable becomes an insignificant predictor of price.

Additionally, the completeness of the data set could be improved in further analysis. We had to leave out thousands of rentals because they missed the relevant values for our chosen predictor variables.

Finally, we recommend to be aware of the impact of seasonality and weekday on prices. There are certainly some season where demand for rentals is particularly high (e.g., on national holidays or during summertime). The same goes for certain days of the week (e.g., the weekend being in higher demand than weekdays). In a next analysis, we would therefore like to focus on the impact of these time-related variable on the variation in price.

5 Acknowledgements

The data for this project is from insideairbnb.com